Some contents and inspiration are from the Rmarkdown files provided in class by Javier Nogales.
In next cell some packages which are going to be used repeatedly are loaded.
library(tidyverse)
library(lubridate)
library(tsibble)
library(tidymodels)
library(modeltime)
library(modeltime.ensemble)
library(timetk)
library(Quandl)
library(urca)
library(forecast)
library(bigtime)
library(modeltime.ensemble)
library(timetk) The data set is obtained from Blockchain, which is a website that publishes data related to Bitcoin, updated daily. It is published in Quandl, so we can access to it by means of an API.
The goal of the project is predicting the bitcoin market price, employing the past values of the own time series as well as other variables. These features are selected previously, thinking that they could ease the prediction of future values. The selected predictors are:
The target is:
Once we have identified the target and the variables we want to use to make the predictions, we need to look for the corresponding tokens in each time series. They will be renamed after according to the variables names:
price: Bitcoin Market Price USD.volume: Bitcoin USD Exchange Trade Volume.trans: Bitcoin Number of Transactions.trans_cost: Bitcoin Cost Per Transaction.tokens = c("BCHAIN/MKPRU", # Bitcoin Market Price USD
"BCHAIN/TRVOU", # Bitcoin USD Exchange Trade Volume
"BCHAIN/NTRAN", # Bitcoin Number of Transactions
"BCHAIN/CPTRA" # Bitcoin Cost Per Transaction
)
names = c("date", "price", "volume", "trans", "trans_cost")With the corresponding tokens, we are able to call the API and collect all the data. In this case it is employed data from 2021-01-01 to 2022-03-18. The frequency is daily, so we have enough samples with this interval.
Quandl.api_key("7ee5wB3N-fu3EqTcth8L") # in case you exceed the number of API requests
data = Quandl(tokens, type="raw",
start_date = "2021-01-01",
end_date = "2022-03-18")
bitcoin = as_tibble(data)A loop is used to iterate through the columns and rename them, in order to be more understandable.
for (i in 1:length(names)){
colnames(bitcoin)[i] = names[i]
}For some function and visualization tools is required having all the columns as rows an identified with the corresponding name.
bitcoin_by_var = pivot_longer(bitcoin, c(2:length(names)),
names_to = "var",
values_to = "value")Finally, we are able to visualize the multiple time series.
bitcoin_by_var %>%
ggplot(aes(x = date, y = value)) +
geom_line() +
labs(title = "Daily Values",
x = "", y="variables") +
facet_grid(vars(var), scales = "free_y") +
theme_bw() Before applying any statistical or machine learning model, it is advisable to apply some preprocessing steps or check if any data transformation is required.
First, we have to check if there is any missing value.
bitcoin_by_var %>%
summarise(na = sum(is.na(value)))## # A tibble: 1 × 1
## na
## <int>
## 1 0
In this case, there are non missing values.
With the ACF and PACF plot of the variables, we are able to detect if data is stationary or any seasonal pattern.
bitcoin_by_var %>%
group_by(var) %>%
plot_acf_diagnostics(date, value, .lags = 14)As it can be seen, the target price does not present any seasonal pattern. However, the ACF plot reveals a high correlation between lags, although only the first and second spikes are relevant in the PACF. For this reason, a first difference in this variables could be worthy.
In the variables, we can appreciate a seasonal pattern each 7 days, weekly seasonality. This may lead to applying a seasonal difference.
If we look into the STL decomposition of the target price, we can see how most of the variance cannot be explained by the season and trend components. All the fluctuations are presented in the remainder.
plot_stl_diagnostics(bitcoin, .date_var = date, .value = price)In spite of that, the season component each week is clear, so it is a point to consider in the models.
If the decomposition is compared with another variable, for example with volume, we can clearly see the differences.
plot_stl_diagnostics(bitcoin, .date_var = date, .value = volume)In this case, the remainder component does not model the variability as much as with price. As the seasonal pattern is also clear here, differentiation should be applied.
After the correlograms and the STL decomposition, differentiating some variables could be worthy for the future models, at least, having this modifications as new variables.
In next cell we are going to apply the KPSS test in order to see if differentiation is required in the target.
ts(bitcoin[,"price"]) %>% ur.kpss() %>% summary()##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.437
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
In spite of overcoming the test respect some significance levels, differentiation should be consider in the target.
diff(ts(bitcoin[,"price"])) %>% ur.kpss() %>% summary()##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.1545
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
After applying it, the margin is larger. If we compare the previous ACF and PACF plots of price.
bitcoin[,"price"] %>%
plot_acf_diagnostics(date, price, .lags = 12)And the new ones.
as_tibble(diff(ts(bitcoin[,"price"]))) %>%
plot_acf_diagnostics(date, price, .lags = 12)The same procedure can be applied with the variables. In this case if we compare the correlogram of volume.
bitcoin[,"volume"] %>%
plot_acf_diagnostics(date, volume, .lags = 12)With the one after applying a weekly seasonal difference.
as_tibble(diff(ts(bitcoin[,"volume"])), lag = 7) %>%
plot_acf_diagnostics(date, volume, .lags = 12)We can see how the weekly dependence is smoothed.
Finally, these differences are applied and stored as new variables in a new tibble table:
price.volume, trans and trans_cost.bitcoin_processed = bitcoin %>%
mutate(
price_diff = difference(price),
trans_diff = difference(trans, 7),
trans_cost_diff = difference(trans_cost, 7),
volume_diff = difference(volume, 7))
bitcoin_processed## # A tibble: 442 × 9
## date price volume trans trans_cost price_diff trans_diff
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021-01-01 28983. 606524119. 258080 112. NA NA
## 2 2021-01-02 29394. 491258038. 297111 107. 411. NA
## 3 2021-01-03 32195. 1393906626. 359116 100. 2802. NA
## 4 2021-01-04 33001. 975831415. 373734 104. 805. NA
## 5 2021-01-05 32035. 1432013253. 354091 99.7 -966. NA
## 6 2021-01-06 34047. 1026923832. 397384 108. 2012. NA
## 7 2021-01-07 36860. 1414154053. 401744 112. 2814. NA
## 8 2021-01-08 39486. 1836614179. 358526 118. 2626. 100446
## 9 2021-01-09 40670. 1794260807. 321389 123. 1184. 24278
## 10 2021-01-10 40241. 711474843. 331865 130. -430. -27251
## # … with 432 more rows, and 2 more variables: trans_cost_diff <dbl>,
## # volume_diff <dbl>
Another important point is if the target can be explained using lagged values of variables or from the own target. With the next function, a lagged column can be added to a given data.
lag_transformer_grouped <- function(data, var, lag){
data %>%
tk_augment_lags(var, .lags = lag)
}In addition, it is important to remark that if we want to predict which is going to be the value of the bitcoin tomorrow, we do not have the future values of other indicators, in our case, from the used variables. For this reason, a lag of 1 day is applied in each variable, in order to be more realistic with the prediction.
The new created variables are stored as before in bitcoin_preprocessed.
bitcoin_processed = bitcoin_processed %>%
lag_transformer_grouped("trans", 1) %>%
lag_transformer_grouped("trans_cost", 1) %>%
lag_transformer_grouped("volume", 1)
bitcoin_processed## # A tibble: 442 × 12
## date price volume trans trans_cost price_diff trans_diff
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021-01-01 28983. 606524119. 258080 112. NA NA
## 2 2021-01-02 29394. 491258038. 297111 107. 411. NA
## 3 2021-01-03 32195. 1393906626. 359116 100. 2802. NA
## 4 2021-01-04 33001. 975831415. 373734 104. 805. NA
## 5 2021-01-05 32035. 1432013253. 354091 99.7 -966. NA
## 6 2021-01-06 34047. 1026923832. 397384 108. 2012. NA
## 7 2021-01-07 36860. 1414154053. 401744 112. 2814. NA
## 8 2021-01-08 39486. 1836614179. 358526 118. 2626. 100446
## 9 2021-01-09 40670. 1794260807. 321389 123. 1184. 24278
## 10 2021-01-10 40241. 711474843. 331865 130. -430. -27251
## # … with 432 more rows, and 5 more variables: trans_cost_diff <dbl>,
## # volume_diff <dbl>, trans_lag1 <dbl>, trans_cost_lag1 <dbl>,
## # volume_lag1 <dbl>
Until the moment we have not seen how the variables are related themselves. For this reason, a VAR model is going to be useful to indicate us how the variables are affected between them.
plot_lag_matrix = function(data){
VAR.L1 <- sparseVAR(Y=scale(as.matrix(data[,-1])),
h = 1,
selection = "cv",
VARpen = "L1")
LhatL1 = lagmatrix(fit=VAR.L1, returnplot=TRUE)
}
plot_lag_matrix(bitcoin)Given the matrix, we are only interested in the first row, since we want to predict price.
trans on price.trans_cost on price.price.These lagged values are stored as new columns in the processed tibble data.
bitcoin_processed = bitcoin_processed %>%
lag_transformer_grouped("price", 7) %>%
lag_transformer_grouped("trans", 3) %>%
lag_transformer_grouped("trans_cost", 2)
bitcoin_processed## # A tibble: 442 × 15
## date price volume trans trans_cost price_diff trans_diff
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021-01-01 28983. 606524119. 258080 112. NA NA
## 2 2021-01-02 29394. 491258038. 297111 107. 411. NA
## 3 2021-01-03 32195. 1393906626. 359116 100. 2802. NA
## 4 2021-01-04 33001. 975831415. 373734 104. 805. NA
## 5 2021-01-05 32035. 1432013253. 354091 99.7 -966. NA
## 6 2021-01-06 34047. 1026923832. 397384 108. 2012. NA
## 7 2021-01-07 36860. 1414154053. 401744 112. 2814. NA
## 8 2021-01-08 39486. 1836614179. 358526 118. 2626. 100446
## 9 2021-01-09 40670. 1794260807. 321389 123. 1184. 24278
## 10 2021-01-10 40241. 711474843. 331865 130. -430. -27251
## # … with 432 more rows, and 8 more variables: trans_cost_diff <dbl>,
## # volume_diff <dbl>, trans_lag1 <dbl>, trans_cost_lag1 <dbl>,
## # volume_lag1 <dbl>, price_lag7 <dbl>, trans_lag3 <dbl>,
## # trans_cost_lag2 <dbl>
After differentiating and creating lagged columns, the new variables are going to have missing values. As there are enough data, these rows are removed, in order not to have problems with the models.
bitcoin_processed = bitcoin_processed %>%
drop_na()
bitcoin_processed ## # A tibble: 435 × 15
## date price volume trans trans_cost price_diff trans_diff
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021-01-08 39486. 1836614179. 358526 118. 2626. 100446
## 2 2021-01-09 40670. 1794260807. 321389 123. 1184. 24278
## 3 2021-01-10 40241. 711474843. 331865 130. -430. -27251
## 4 2021-01-11 38240. 1415450813. 333958 113. -2001. -39776
## 5 2021-01-12 35545. 3039065356. 336627 113. -2695. -17464
## 6 2021-01-13 34012. 1344114135. 319167 117. -1533. -78217
## 7 2021-01-14 37393. 1209077692. 338809 121. 3381. -62935
## 8 2021-01-15 39158. 988940894. 344002 121. 1765. -14524
## 9 2021-01-16 36829. 1194986405. 308461 122. -2330. -12928
## 10 2021-01-17 36065. 685084802. 271874 117. -763. -59991
## # … with 425 more rows, and 8 more variables: trans_cost_diff <dbl>,
## # volume_diff <dbl>, trans_lag1 <dbl>, trans_cost_lag1 <dbl>,
## # volume_lag1 <dbl>, price_lag7 <dbl>, trans_lag3 <dbl>,
## # trans_cost_lag2 <dbl>
Only 7 rows were removed.
Next function is employed to get the forecast and prediction intervals in one split. The main arguments are:
sp: integer to select the split.conf: the confidence level. It is used 95% as default.Mention the argument back only is used in the visualization of the actual data. It indicates how many samples are plotted from the training set (actual data).
forecast_results = function(model_table, sp=1, conf=0.95, back=30){
calibration_table = model_table %>%
modeltime_calibrate(testing(splits$splits[[sp]]))
train_idx = splits$splits[[sp]]$in_id
beg = train_idx[length(train_idx)] - back
test_idx = splits$splits[[sp]]$out_id
n_col = ncol(bitcoin_processed)
fc = calibration_table %>%
modeltime_forecast(actual_data = bitcoin_processed[beg:test_idx[length(test_idx)],
1:n_col],
new_data = bitcoin_processed[test_idx, 1:n_col],
conf_interval = conf)
out_list = list("fc"=fc, "calibration"=calibration_table)
return(out_list)
}Next function are used to show the forecast accuracy and plot the predicted values.
plot_forecast_table = function(calibration_table){
calibration_table %>%
modeltime_accuracy() %>%
table_modeltime_accuracy(.interactive = FALSE)
}
plot_forecast = function(fc_results){
fc_results %>%
plot_modeltime_forecast(.interactive = FALSE) +
labs(title = "Forecasts", y = "")
}The evaluation of the models is going to be carried out by cross validation, using sliding windows. Next functions are used to get the corresponding splits and plot them.
cv_split = function(data,
initial="5 months",
assess="1 months",
skip="2 months",
cumulative=FALSE){
splits = data %>%
time_series_cv(date_var = date,
initial = initial,
assess = assess,
skip = skip,
cumulative = cumulative)
return(splits)
}
plot_cv_splits = function(splits){
splits %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(date, price, .interactive = FALSE)
}
splits = cv_split(bitcoin_processed)
plot_cv_splits(splits)There are 5 splits. As models need to be first trained in one split, the first one is selected.
sp = 1These functions are employed to train models by cross validation, as well as plotting the process and the accuracy table. They are used in next sections.
training_results = function(model_table, splits){
resample_results = model_table %>%
modeltime_fit_resamples(resamples = splits,
control = control_resamples(verbose = TRUE)
)
return(resample_results)
}
plot_training_cv = function(resample_results){
resample_results %>%
plot_modeltime_resamples(.summary_fn = mean,
.point_size = 3,
.interactive = FALSE
)
}
plot_training_table = function(resample_results){
resample_results %>%
modeltime_resample_accuracy(summary_fns = list(mean = mean)) %>%
table_modeltime_accuracy(.interactive = FALSE)
}This section is composed by simple models where only the target is used and dynamic regression, in which the rest of the variables are employed.
Simple models using modeltime are used to start. Only the target price is employed and can be useful as benchmarks. The ARIMA and Prophet are selected.
First implementation corresponds with an auto arima.
mod_arima =
arima_reg() %>%
set_engine("auto_arima") %>%
fit(price ~ date,
training(splits$splits[[sp]]))As it has been explained in the preprocessing steps, first and seasonal difference of price may be worthy. In addition, the significant first spike in the PACF is translated as \(p=1\) in the arima.
Mention the seasonal MA term is included because it improves the performance taking into account the BIC.
mod_arima_custom =
arima_reg(
seasonal_period = "auto",
non_seasonal_ar = 1,
non_seasonal_differences = 1,
non_seasonal_ma = 0,
seasonal_ar = 0,
seasonal_differences = 1,
seasonal_ma = 1
) %>%
set_engine("arima") %>%
fit(price ~ date,
training(splits$splits[[sp]])) First the default prophet model is employed.
mod_prophet =
prophet_reg() %>%
set_engine("prophet") %>%
fit(price ~ date,
training(splits$splits[[sp]]))Second, some parameters are selected in order to see if the performance can improve.
mod_prophet_custom =
prophet_reg(
growth = "linear",
season = "additive",
seasonality_daily = "auto"
) %>%
set_engine("prophet") %>%
fit(price ~ date,
training(splits$splits[[sp]]))Once the models are defined, they are stored in a modeltime_table.
autoMod_table = modeltime_table(
mod_arima,
mod_arima_custom,
mod_prophet,
mod_prophet_custom
)
autoMod_table## # Modeltime Table
## # A tibble: 4 × 3
## .model_id .model .model_desc
## <int> <list> <chr>
## 1 1 <fit[+]> ARIMA(0,1,0)(2,0,0)[7]
## 2 2 <fit[+]> ARIMA(1,1,0)(0,1,1)[7]
## 3 3 <fit[+]> PROPHET
## 4 4 <fit[+]> PROPHET
First step is obtaining the forecast data and the calibration table with all the models.
autoForecast = forecast_results(autoMod_table)The calibration table in which all models are stored can be accessed as:
autoForecast$calibration## # Modeltime Table
## # A tibble: 4 × 5
## .model_id .model .model_desc .type .calibration_data
## <int> <list> <chr> <chr> <list>
## 1 1 <fit[+]> ARIMA(0,1,0)(2,0,0)[7] Test <tibble [31 × 4]>
## 2 2 <fit[+]> ARIMA(1,1,0)(0,1,1)[7] Test <tibble [31 × 4]>
## 3 3 <fit[+]> PROPHET Test <tibble [31 × 4]>
## 4 4 <fit[+]> PROPHET Test <tibble [31 × 4]>
The forecast values can be visualized as:
idx = 100:110
autoForecast$fc[idx,]## # A tibble: 11 × 7
## .model_id .model_desc .key .index .value .conf_lo .conf_hi
## <int> <chr> <fct> <date> <dbl> <dbl> <dbl>
## 1 2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-02-22 42372. 36803. 47940.
## 2 2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-02-23 42266. 36698. 47835.
## 3 2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-02-24 42338. 36770. 47906.
## 4 2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-02-25 41955. 36386. 47523.
## 5 2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-02-26 41701. 36132. 47269.
## 6 2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-02-27 41844. 36276. 47413.
## 7 2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-02-28 42121. 36552. 47689.
## 8 2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-03-01 42132. 36564. 47701.
## 9 2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-03-02 42027. 36459. 47596.
## 10 2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-03-03 42099. 36530. 47667.
## 11 2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-03-04 41715. 36147. 47284.
We are interested in:
.value: the value being forecasted..conf_lo: the lower limit of the prediction interval..conf_hi: the upper limit of the prediction interval.The test accuracy is shown with next function.
plot_forecast_table(autoForecast$calibration)| Accuracy Table | ||||||||
|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
| 1 | ARIMA(0,1,0)(2,0,0)[7] | Test | 3348.60 | 8.58 | 2.55 | 8.16 | 3700.40 | 0.36 |
| 2 | ARIMA(1,1,0)(0,1,1)[7] | Test | 2499.28 | 6.39 | 1.90 | 6.16 | 2818.08 | 0.05 |
| 3 | PROPHET | Test | 2995.19 | 7.27 | 2.28 | 7.66 | 3632.76 | 0.02 |
| 4 | PROPHET | Test | 2995.19 | 7.27 | 2.28 | 7.66 | 3632.76 | 0.02 |
The arima model in which the values are introduced manually obtains the best score. Finally, we can plot the forecast values for the selected split (sp=1) as well as the prediction intervals.
plot_forecast(autoForecast$fc)Given the modeltime_table, the training function can be called. First, the models are trained and evaluated using all the splits.
autoMod_resample_results = training_results(autoMod_table, splits)## ── Fitting Resamples ────────────────────────────────────────────
##
## 13.933 sec elapsed
We can easily plot the accuracy results in every split.
plot_training_cv(autoMod_resample_results)As it is shown, in general terms the arima models perform better in every split than prophet. The mean accuracy is plotted with next function.
plot_training_table(autoMod_resample_results)| Accuracy Table | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | n | mae_mean | mape_mean | mase_mean | smape_mean | rmse_mean | rsq_mean |
| 1 | ARIMA(0,1,0)(2,0,0)[7] | Resamples | 5 | 3550.09 | 8.06 | 3.16 | 8.03 | 4040.96 | NA |
| 2 | ARIMA(1,1,0)(0,1,1)[7] | Resamples | 5 | 3553.95 | 7.16 | 2.86 | 6.81 | 4145.97 | 0.21 |
| 3 | PROPHET | Resamples | 5 | 5129.00 | 12.17 | 4.92 | 13.04 | 5886.67 | 0.25 |
| 4 | PROPHET | Resamples | 5 | 5129.00 | 12.17 | 4.92 | 13.04 | 5886.67 | 0.25 |
As expected, the arima models overcome prophet. However, to remark that the automatic arima and the manual one perform similar, when in general the manual way provides better results and in the forecasting there was a significant difference.
Now the variables are considered in a dynamic regression with ARIMA.
First model employs only the original variables. Mention it is not a very realistic scenario, since as mentioned before, the future values of these predictors are not known when we want to predict the future target.
dynamic_arima1 =
arima_reg(
seasonal_period = "auto",
non_seasonal_ar = 1,
non_seasonal_differences = 1,
non_seasonal_ma = 0,
seasonal_ar = 0,
seasonal_differences = 1,
seasonal_ma = 1
) %>%
set_engine("arima") %>%
fit(price ~ date + trans + trans_cost + volume,
training(splits$splits[[sp]]))In the second model the same idea is applied by employing the lagged values in the variables.
dynamic_arima2 =
arima_reg(
seasonal_period = "auto",
non_seasonal_ar = 1,
non_seasonal_differences = 1,
non_seasonal_ma = 0,
seasonal_ar = 0,
seasonal_differences = 1,
seasonal_ma = 1
) %>%
set_engine("arima") %>%
fit(price ~ date + trans_lag1 + trans_cost_lag1 + volume_lag1,
training(splits$splits[[sp]]))In this case we use the lagged values given by the VAR model. Mention the one corresponding to price is not used since this is modeled by arima.
dynamic_arima3 =
arima_reg(
seasonal_period = "auto",
non_seasonal_ar = 1,
non_seasonal_differences = 1,
non_seasonal_ma = 0,
seasonal_ar = 0,
seasonal_differences = 1,
seasonal_ma = 1
) %>%
set_engine("arima") %>%
fit(price ~ date + trans_lag3 + trans_cost_lag2,
training(splits$splits[[sp]]))This model employs the differentiated values. As before, the one corresponding to price is not used, since this is the role of arima.
dynamic_arima4 =
arima_reg(
seasonal_period = "auto",
non_seasonal_ar = 1,
non_seasonal_differences = 1,
non_seasonal_ma = 0,
seasonal_ar = 0,
seasonal_differences = 1,
seasonal_ma = 1
) %>%
set_engine("arima") %>%
fit(price ~ date + trans_diff + trans_cost_diff + volume_diff,
training(splits$splits[[sp]]))This models employs a mixed combination of variables.
dynamic_arima5 =
arima_reg(
seasonal_period = "auto",
non_seasonal_ar = 1,
non_seasonal_differences = 1,
non_seasonal_ma = 0,
seasonal_ar = 0,
seasonal_differences = 1,
seasonal_ma = 1
) %>%
set_engine("arima") %>%
fit(price ~ date + volume_lag1 + trans_cost_lag2 + trans_diff,
training(splits$splits[[sp]]))The last one considers all the available predictors.
dynamic_arima6 =
arima_reg(
seasonal_period = "auto",
non_seasonal_ar = 1,
non_seasonal_differences = 1,
non_seasonal_ma = 0,
seasonal_ar = 0,
seasonal_differences = 1,
seasonal_ma = 1
) %>%
set_engine("arima") %>%
fit(price ~ date + trans_lag1 + trans_cost_lag1 + volume_lag1 +
trans_lag3 + trans_cost_lag2 +
trans_diff + trans_cost_diff + volume_diff,
training(splits$splits[[sp]]))The evaluation process is the same as before. First step is saving all the models in a modeltime_table.
dynMod_table = modeltime_table(
dynamic_arima1,
dynamic_arima2,
dynamic_arima3,
dynamic_arima4,
dynamic_arima5,
dynamic_arima6
)
dynMod_table## # Modeltime Table
## # A tibble: 6 × 3
## .model_id .model .model_desc
## <int> <list> <chr>
## 1 1 <fit[+]> REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS
## 2 2 <fit[+]> REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS
## 3 3 <fit[+]> REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS
## 4 4 <fit[+]> REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS
## 5 5 <fit[+]> REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS
## 6 6 <fit[+]> REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS
As before, we can get the accuracy table with the defined functions.
dynForecast = forecast_results(dynMod_table)
plot_forecast_table(dynForecast$calibration)| Accuracy Table | ||||||||
|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
| 1 | REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS | Test | 2095.07 | 5.30 | 1.60 | 5.19 | 2307.57 | 0.17 |
| 2 | REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS | Test | 2172.84 | 5.54 | 1.65 | 5.38 | 2448.70 | 0.24 |
| 3 | REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS | Test | 2526.46 | 6.46 | 1.92 | 6.23 | 2849.38 | 0.03 |
| 4 | REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS | Test | 2293.36 | 5.85 | 1.75 | 5.67 | 2576.88 | 0.18 |
| 5 | REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS | Test | 2482.06 | 6.35 | 1.89 | 6.12 | 2800.30 | 0.14 |
| 6 | REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS | Test | 1544.31 | 3.88 | 1.18 | 3.84 | 1796.65 | 0.43 |
The results are very similar, excluding the last models which overcomes significantly the rest. However this one considers all the variables and may overfit.
plot_forecast(dynForecast$fc)Looking into the forecasting series, there are not significant differences between the models.
Using all the splits.
dynMod_resample_results = training_results(dynMod_table, splits)## ── Fitting Resamples ────────────────────────────────────────────
##
## 27.733 sec elapsed
plot_training_cv(dynMod_resample_results)All models perform similar along all the splits. If we look into the average accuracy results.
plot_training_table(dynMod_resample_results)| Accuracy Table | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | n | mae_mean | mape_mean | mase_mean | smape_mean | rmse_mean | rsq_mean |
| 1 | REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS | Resamples | 5 | 2659.22 | 5.40 | 2.20 | 5.26 | 3135.23 | 0.19 |
| 2 | REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS | Resamples | 5 | 3225.13 | 6.62 | 2.64 | 6.34 | 3825.89 | 0.22 |
| 3 | REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS | Resamples | 5 | 3434.56 | 6.92 | 2.77 | 6.58 | 4021.33 | 0.19 |
| 4 | REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS | Resamples | 5 | 3836.31 | 7.73 | 3.10 | 7.34 | 4482.92 | 0.18 |
| 5 | REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS | Resamples | 5 | 3182.75 | 6.41 | 2.60 | 6.16 | 3760.46 | 0.15 |
| 6 | REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS | Resamples | 5 | 3376.27 | 7.06 | 2.84 | 6.74 | 3970.77 | 0.19 |
As expected, the best model is the one in which we provide the current values of the variables. Nevertheless, as mentioned before, that is not a realistic scenario.
However, the performance obtained by the arima with the one day lagged values as variables obtains a good performance. Also mention the one which uses a mixture of lagged and differentiated variables.
In this part we are going to consider the statistical models which provide the best results in the forecasting, since at the end it is the most recent data and in which we are more interested. The selected models are:
staMod_table = modeltime_table(
mod_arima_custom,
dynamic_arima6
)
staMod_table## # Modeltime Table
## # A tibble: 2 × 3
## .model_id .model .model_desc
## <int> <list> <chr>
## 1 1 <fit[+]> ARIMA(1,1,0)(0,1,1)[7]
## 2 2 <fit[+]> REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS
staForecast = forecast_results(staMod_table)
plot_forecast_table(staForecast$calibration)| Accuracy Table | ||||||||
|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
| 1 | ARIMA(1,1,0)(0,1,1)[7] | Test | 2499.28 | 6.39 | 1.90 | 6.16 | 2818.08 | 0.05 |
| 2 | REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS | Test | 1544.31 | 3.88 | 1.18 | 3.84 | 1796.65 | 0.43 |
As we have seen, the dynamic regression overcomes the simple arima.
plot_forecast(staForecast$fc)However, the forecast and prediction intervals are similar.
staMod_resample_results = training_results(staMod_table, splits)## ── Fitting Resamples ────────────────────────────────────────────
##
## 7.574 sec elapsed
plot_training_cv(staMod_resample_results)The performance of the model with regressors is better than the simple arima.
plot_training_table(staMod_resample_results)| Accuracy Table | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | n | mae_mean | mape_mean | mase_mean | smape_mean | rmse_mean | rsq_mean |
| 1 | ARIMA(1,1,0)(0,1,1)[7] | Resamples | 5 | 3553.95 | 7.16 | 2.86 | 6.81 | 4145.97 | 0.21 |
| 2 | REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS | Resamples | 5 | 3376.27 | 7.06 | 2.84 | 6.74 | 3970.77 | 0.19 |
As we can see, the aggregation of variables in a dynamic regression improves the result obtained by using a simple arima model.
The goal is predicting the target using all the available features. With machine learning, all the time series are used as predictors, the target and the variables, to predict the future value of the response.
Before applying machine learning models, some steps have to be carried out. In this case a recipe is creating, in which:
volume, trans and trans_cost are removed, in order to be more realistic respect the prediction.recipe_spec = recipe(price ~ .,
training(splits$splits[[sp]])) %>%
step_timeseries_signature(date) %>%
step_rm(contains("date_")) %>%
step_rm("volume", "trans", "trans_cost") %>%
step_normalize(price_diff, trans_diff, trans_cost_diff, volume_diff,
trans_lag1, trans_cost_lag1, volume_lag1,
price_lag7, trans_lag3, trans_cost_lag2)
recipe_spec %>% prep() %>% juice() %>% head()## # A tibble: 6 × 12
## date price_diff trans_diff trans_cost_diff volume_diff trans_lag1
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021-09-18 -0.284 0.504 0.540 -0.463 0.270
## 2 2021-09-19 0.601 0.428 0.825 0.154 -1.13
## 3 2021-09-20 -0.569 0.780 -0.170 -0.00766 -2.01
## 4 2021-09-21 -2.51 -1.29 -1.62 0.828 0.244
## 5 2021-09-22 -1.31 0.344 -0.984 1.87 -0.585
## 6 2021-09-23 1.76 -0.137 0.638 0.457 0.406
## # … with 6 more variables: trans_cost_lag1 <dbl>, volume_lag1 <dbl>,
## # price_lag7 <dbl>, trans_lag3 <dbl>, trans_cost_lag2 <dbl>, price <dbl>
Different models are going to be fitted and evaluated to predict the bitcoin value. The options are elastic net, random forest, XGBoost and the neural network NNETAR.
In order to find the best hyper-parameters, different options are tried. Mention an automatic hyper-parameter search using cross validation, which is the optimal approach, was not possible to implement due to several errors during the process. One cause could be using several variables rather than using only the target.
The considered hyper-parameters are:
penalty: a non-negative number representing the total amount of regularization.mixture: a number between zero and one that is the proportion of L1 regularization (lasso) in the model.The idea is to penalize models which uses a lot of variables. The first one is a pure mixture between lasso and ridge regression.
mod_glmnet1 =
linear_reg(
penalty = 0.1,
mixture = 0.5
) %>%
set_engine("glmnet")
wflw_glmnet1 = workflow() %>%
add_model(mod_glmnet1) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
fit(training(splits$splits[[sp]]))Second one has a larger value of mixture, so the model will be similar to a lasso regression.
mod_glmnet2 =
linear_reg(
penalty = 0.1,
mixture = 0.9
) %>%
set_engine("glmnet")
wflw_glmnet2 = workflow() %>%
add_model(mod_glmnet2) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
fit(training(splits$splits[[sp]]))In contrast, the third one is close to a ridge regression.
mod_glmnet3 =
linear_reg(
penalty = 0.1,
mixture = 0.1
) %>%
set_engine("glmnet")
wflw_glmnet3 = workflow() %>%
add_model(mod_glmnet3) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
fit(training(splits$splits[[sp]]))As with the statistical models, they are stored in a modeltime_table.
glmnetMod_table = modeltime_table(
wflw_glmnet1,
wflw_glmnet2,
wflw_glmnet3
)
glmnetMod_table## # Modeltime Table
## # A tibble: 3 × 3
## .model_id .model .model_desc
## <int> <list> <chr>
## 1 1 <workflow> GLMNET
## 2 2 <workflow> GLMNET
## 3 3 <workflow> GLMNET
glmnetForecast = forecast_results(glmnetMod_table)
plot_forecast_table(glmnetForecast$calibration)| Accuracy Table | ||||||||
|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
| 1 | GLMNET | Test | 1544.49 | 3.86 | 1.18 | 3.87 | 1840.15 | 0.67 |
| 2 | GLMNET | Test | 1545.39 | 3.86 | 1.18 | 3.87 | 1841.81 | 0.67 |
| 3 | GLMNET | Test | 1537.22 | 3.84 | 1.17 | 3.85 | 1833.10 | 0.68 |
The performance is almost the same in the three models.
plot_forecast(dynForecast$fc)As expected, the predictions are similar.
The evaluation of the models is carried out as before. The goal is finding the best model according to the selected hyper-parameters.
glmnetMod_resample_results = training_results(glmnetMod_table, splits)## ── Fitting Resamples ────────────────────────────────────────────
##
## 11.179 sec elapsed
plot_training_cv(glmnetMod_resample_results)As in the forecasting, the resample performance is almost equal.
plot_training_table(glmnetMod_resample_results)| Accuracy Table | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | n | mae_mean | mape_mean | mase_mean | smape_mean | rmse_mean | rsq_mean |
| 1 | GLMNET | Resamples | 5 | 2118.08 | 4.93 | 2.02 | 4.93 | 2473.77 | 0.37 |
| 2 | GLMNET | Resamples | 5 | 2117.04 | 4.92 | 2.01 | 4.93 | 2472.11 | 0.37 |
| 3 | GLMNET | Resamples | 5 | 2120.50 | 4.93 | 2.02 | 4.93 | 2477.99 | 0.37 |
As expected, the three models obtain very close results.
In this case, the selected hyper-parameters are:
mtry: an integer for the number of predictors that will be randomly sampled at each split when creating the tree models.trees: an integer for the number of trees contained in the ensemble.min_n: an integer for the minimum number of data points in a node that are required for the node to be split further.In order to get reproducible results, the seed is fixed, since this model introduces randomness.
We start with a low value for the number of trees.
set.seed(123)
mod_rf1 =
rand_forest(
mtry = 5,
trees = 50,
min_n = 5,
) %>%
set_engine("randomForest")
wflw_rf1 = workflow() %>%
add_model(mod_rf1) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
fit(training(splits$splits[[sp]]))The number of trees is increased.
set.seed(123)
mod_rf2 =
rand_forest(
mtry = 5,
trees = 100,
min_n = 5,
) %>%
set_engine("randomForest")
wflw_rf2 = workflow() %>%
add_model(mod_rf2) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
fit(training(splits$splits[[sp]]))trees increases again.
set.seed(123)
mod_rf3 =
rand_forest(
mtry = 5,
trees = 200,
min_n = 5,
) %>%
set_engine("randomForest")
wflw_rf3 = workflow() %>%
add_model(mod_rf3) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
fit(training(splits$splits[[sp]]))Store them in the model table.
rfMod_table = modeltime_table(
wflw_rf1,
wflw_rf2,
wflw_rf3
)
rfMod_table## # Modeltime Table
## # A tibble: 3 × 3
## .model_id .model .model_desc
## <int> <list> <chr>
## 1 1 <workflow> RANDOMFOREST
## 2 2 <workflow> RANDOMFOREST
## 3 3 <workflow> RANDOMFOREST
rfForecast = forecast_results(rfMod_table)
plot_forecast_table(rfForecast$calibration)| Accuracy Table | ||||||||
|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
| 1 | RANDOMFOREST | Test | 2354.59 | 5.99 | 1.79 | 5.74 | 2854.09 | 0.36 |
| 2 | RANDOMFOREST | Test | 2241.57 | 5.73 | 1.71 | 5.49 | 2798.66 | 0.36 |
| 3 | RANDOMFOREST | Test | 2294.46 | 5.86 | 1.75 | 5.61 | 2845.40 | 0.36 |
Again, the models behave similar.
plot_forecast(rfForecast$fc)And the predictions are very close.
The evaluation procedure as usual, training using all the splits.
rfMod_resample_results = training_results(rfMod_table, splits)## ── Fitting Resamples ────────────────────────────────────────────
##
## 11.55 sec elapsed
plot_training_cv(rfMod_resample_results)The models perform very similar.
plot_training_table(rfMod_resample_results)| Accuracy Table | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | n | mae_mean | mape_mean | mase_mean | smape_mean | rmse_mean | rsq_mean |
| 1 | RANDOMFOREST | Resamples | 5 | 2797.47 | 6.29 | 2.49 | 6.19 | 3317.35 | 0.24 |
| 2 | RANDOMFOREST | Resamples | 5 | 2883.14 | 6.58 | 2.60 | 6.43 | 3441.51 | 0.23 |
| 3 | RANDOMFOREST | Resamples | 5 | 2730.33 | 6.25 | 2.46 | 6.11 | 3319.51 | 0.21 |
As it can be seen, the number of trees is not very relevant, since changing this feature does not lead to a significant change in the performance.
The employed hyper-parameters are:
trees: as before, an integer for the number of trees contained in the ensemble.min_n: as before, an integer for the minimum number of data points in a node that is required for the node to be split further.tree_depth: an integer for the maximum depth of the tree.mod_xgboost1 =
boost_tree(
trees = 50,
min_n = 20,
tree_depth = 6
) %>%
set_engine("xgboost")
wflw_xgboost1 = workflow() %>%
add_model(mod_xgboost1) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
fit(training(splits$splits[[sp]]))The min_n is decreased while the tree_depth increases.
mod_xgboost2 =
boost_tree(
trees = 50,
min_n = 10,
tree_depth = 7
) %>%
set_engine("xgboost")
wflw_xgboost2 = workflow() %>%
add_model(mod_xgboost2) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
fit(training(splits$splits[[sp]]))Same step is applied again.
mod_xgboost3 =
boost_tree(
trees = 50,
min_n = 5,
tree_depth = 8
) %>%
set_engine("xgboost")
wflw_xgboost3 = workflow() %>%
add_model(mod_xgboost3) %>%
add_recipe(recipe_spec %>% step_rm(date)) %>%
fit(training(splits$splits[[sp]]))Store models in the table.
xgMod_table = modeltime_table(
wflw_xgboost1,
wflw_xgboost2,
wflw_xgboost3
)
xgMod_table## # Modeltime Table
## # A tibble: 3 × 3
## .model_id .model .model_desc
## <int> <list> <chr>
## 1 1 <workflow> XGBOOST
## 2 2 <workflow> XGBOOST
## 3 3 <workflow> XGBOOST
xgForecast = forecast_results(xgMod_table)
plot_forecast_table(xgForecast$calibration)| Accuracy Table | ||||||||
|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
| 1 | XGBOOST | Test | 1561.01 | 3.88 | 1.19 | 3.83 | 1877.63 | 0.66 |
| 2 | XGBOOST | Test | 1990.02 | 5.02 | 1.52 | 4.85 | 2553.93 | 0.49 |
| 3 | XGBOOST | Test | 1784.99 | 4.55 | 1.36 | 4.42 | 2217.33 | 0.37 |
The first models overcomes the other two.
plot_forecast(xgForecast$fc)In the forecast plot we can see how the red model is closer to the actual data.
Training and evaluation process.
xgMod_resample_results = training_results(xgMod_table, splits)## ── Fitting Resamples ────────────────────────────────────────────
##
## 11.502 sec elapsed
plot_training_cv(xgMod_resample_results)The performance of the third model overcomes the other two in most of the splits and metrics.
plot_training_table(xgMod_resample_results)| Accuracy Table | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | n | mae_mean | mape_mean | mase_mean | smape_mean | rmse_mean | rsq_mean |
| 1 | XGBOOST | Resamples | 5 | 2878.20 | 6.14 | 2.52 | 6.18 | 3447.91 | 0.25 |
| 2 | XGBOOST | Resamples | 5 | 2818.00 | 6.26 | 2.52 | 6.23 | 3467.26 | 0.15 |
| 3 | XGBOOST | Resamples | 5 | 2453.09 | 5.62 | 2.27 | 5.54 | 3102.39 | 0.16 |
As expected, the best result is obtained by the third model. In this case min_n and tree_depth have a clear impact in the performance.
Mention the difference between the initial forecasting and the posterior re-training, now the best model is the third one and the worst the first, just the other way round.
A neural network is the most complex model employed in this work. The selected hyper-parameters are:
non_seasonal_ar: the order of the non-seasonal auto-regressive (AR) terms. In this case we selected the same value as in the arima models.seasonal_ar: the order of the seasonal auto-regressive (SAR) terms. The same value as in the arima models.hidden_units: an integer for the number of units in the hidden model.penalty: a non-negative numeric value for the amount of weight decay.epochs: an integer for the number of training iterations.Mention the initialization of the weights in the neural network is random, changing the performance in every execution. That is why the seed is fixed again here.
set.seed(123)
mod_nnetar1 =
nnetar_reg(
non_seasonal_ar = 1,
seasonal_ar = 0,
hidden_units = 10,
penalty = 0.1,
epochs = 10,
) %>%
set_engine("nnetar")
wflw_nnetar1 = workflow() %>%
add_model(mod_nnetar1) %>%
add_recipe(recipe_spec) %>%
fit(training(splits$splits[[sp]]))The number of hidden units is increased. In order to avoid overfitting, the penalty is increased and the number of epochs reduced.
set.seed(123)
mod_nnetar2 =
nnetar_reg(
seasonal_period = "auto",
non_seasonal_ar = 1,
seasonal_ar = 0,
hidden_units = 20,
penalty = 0.2,
epochs = 5,
) %>%
set_engine("nnetar")
wflw_nnetar2 = workflow() %>%
add_model(mod_nnetar2) %>%
add_recipe(recipe_spec) %>%
fit(training(splits$splits[[sp]]))The number of epochs is reduced again.
set.seed(123)
mod_nnetar3 =
nnetar_reg(
seasonal_period = "auto",
non_seasonal_ar = 1,
seasonal_ar = 0,
hidden_units = 25,
penalty = 0.2,
epochs = 3,
) %>%
set_engine("nnetar")
wflw_nnetar3 = workflow() %>%
add_model(mod_nnetar3) %>%
add_recipe(recipe_spec) %>%
fit(training(splits$splits[[sp]]))The three networks are stored.
set.seed(123)
nnMod_table = modeltime_table(
wflw_nnetar1,
wflw_nnetar2,
wflw_nnetar3
)
nnMod_table## # Modeltime Table
## # A tibble: 3 × 3
## .model_id .model .model_desc
## <int> <list> <chr>
## 1 1 <workflow> NNAR(1,10)
## 2 2 <workflow> NNAR(1,20)
## 3 3 <workflow> NNAR(1,25)
nnForecast = forecast_results(nnMod_table)
plot_forecast_table(nnForecast$calibration)| Accuracy Table | ||||||||
|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
| 1 | NNAR(1,10) | Test | 1025.39 | 2.59 | 0.78 | 2.56 | 1177.56 | 0.80 |
| 2 | NNAR(1,20) | Test | 1182.16 | 2.94 | 0.90 | 2.94 | 1385.34 | 0.81 |
| 3 | NNAR(1,25) | Test | 1523.76 | 3.76 | 1.16 | 3.86 | 1865.49 | 0.68 |
The first network is more accurate. Mention it is the simplest one in terms of parameters.
plot_forecast(nnForecast$fc)As expected, the red forecast is closer to the actual data.
Perform cross validation.
nnMod_resample_results = training_results(nnMod_table, splits)## ── Fitting Resamples ────────────────────────────────────────────
##
## 15.861 sec elapsed
plot_training_cv(nnMod_resample_results)The first model overcomes the other two in most of splits and metrics.
plot_training_table(nnMod_resample_results)| Accuracy Table | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | n | mae_mean | mape_mean | mase_mean | smape_mean | rmse_mean | rsq_mean |
| 1 | NNAR(1,10) | Resamples | 5 | 2583.34 | 5.43 | 2.18 | 5.48 | 2823.77 | 0.66 |
| 2 | NNAR(1,20) | Resamples | 5 | 3342.63 | 7.06 | 2.91 | 7.07 | 3643.16 | 0.44 |
| 3 | NNAR(1,25) | Resamples | 5 | 3188.16 | 7.01 | 2.74 | 7.31 | 3654.71 | 0.40 |
The simplest neural network obtains the best performance.
It is important to remark that selecting the hyper-parameters in a neural network is a complex and time consuming process. The number of parameters increases a lot with larger values in the hyper-parameters, having the risk of overfitting. So it is a good practice taking care of obtaining a too complex network.
Finally the best machine learning models are selected to predict the value of the target.
mlMod_table = modeltime_table(
wflw_glmnet3,
wflw_rf2,
wflw_xgboost1,
wflw_nnetar1
)
mlMod_table## # Modeltime Table
## # A tibble: 4 × 3
## .model_id .model .model_desc
## <int> <list> <chr>
## 1 1 <workflow> GLMNET
## 2 2 <workflow> RANDOMFOREST
## 3 3 <workflow> XGBOOST
## 4 4 <workflow> NNAR(1,10)
mlForecast = forecast_results(mlMod_table)
plot_forecast_table(mlForecast$calibration)| Accuracy Table | ||||||||
|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
| 1 | GLMNET | Test | 1537.22 | 3.84 | 1.17 | 3.85 | 1833.10 | 0.68 |
| 2 | RANDOMFOREST | Test | 2241.57 | 5.73 | 1.71 | 5.49 | 2798.66 | 0.36 |
| 3 | XGBOOST | Test | 1561.01 | 3.88 | 1.19 | 3.83 | 1877.63 | 0.66 |
| 4 | NNAR(1,10) | Test | 1025.39 | 2.59 | 0.78 | 2.56 | 1177.56 | 0.80 |
The best model corresponds with the neural network, followed by the elastic net.
plot_forecast(mlForecast$fc)As it can be seen, the NNETAR model fits very well the data as well as the elastic net. Nevertheless, all models make good predictions with narrow prediction intervals.
The cross validation process is repeated again.
ml_resample_results = training_results(mlMod_table, splits)## ── Fitting Resamples ────────────────────────────────────────────
##
## 16.568 sec elapsed
plot_training_cv(ml_resample_results)The elastic net is the best in most of the splits and metrics. Mention the behavior of the neural network is very irregular, in some splits it is very accurate while in others the error is large.
plot_training_table(ml_resample_results)| Accuracy Table | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | n | mae_mean | mape_mean | mase_mean | smape_mean | rmse_mean | rsq_mean |
| 1 | GLMNET | Resamples | 5 | 2120.50 | 4.93 | 2.02 | 4.93 | 2477.99 | 0.37 |
| 2 | RANDOMFOREST | Resamples | 5 | 2737.18 | 6.30 | 2.48 | 6.14 | 3307.81 | 0.22 |
| 3 | XGBOOST | Resamples | 5 | 2878.20 | 6.14 | 2.52 | 6.18 | 3447.91 | 0.25 |
| 4 | NNAR(1,10) | Resamples | 5 | 2778.90 | 5.71 | 2.33 | 5.77 | 2965.17 | 0.67 |
As expected, the best result is obtained by the elastic net, whereas the most complex model, the NNETAR network, has decreased its score averaging with all the splits.
The machine learning models got better forecasting results and cross validation performance than the statistical models.
As a last point, it is interesting to see if the performance obtained by NNETAR model can be improved by means of ensembles.
If we look again into the forecast table, the second and third best models are the elastic net and the XGBoost respectively.
plot_forecast_table(mlForecast$calibration)| Accuracy Table | ||||||||
|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
| 1 | GLMNET | Test | 1537.22 | 3.84 | 1.17 | 3.85 | 1833.10 | 0.68 |
| 2 | RANDOMFOREST | Test | 2241.57 | 5.73 | 1.71 | 5.49 | 2798.66 | 0.36 |
| 3 | XGBOOST | Test | 1561.01 | 3.88 | 1.19 | 3.83 | 1877.63 | 0.66 |
| 4 | NNAR(1,10) | Test | 1025.39 | 2.59 | 0.78 | 2.56 | 1177.56 | 0.80 |
First ensemble is a combination of these two.
idx1 = c(1, 3) # selection of models
mod_ensemble1 = mlForecast$calibration[idx1,] %>%
ensemble_average(type = "mean")
mod_ensemble1## ── Modeltime Ensemble ───────────────────────────────────────────
## Ensemble of 2 Models (MEAN)
##
## # Modeltime Table
## # A tibble: 2 × 5
## .model_id .model .model_desc .type .calibration_data
## <int> <list> <chr> <chr> <list>
## 1 1 <workflow> GLMNET Test <tibble [31 × 4]>
## 2 3 <workflow> XGBOOST Test <tibble [31 × 4]>
The forecasting process is the same as before.
forecast_ens1 = forecast_results(mod_ensemble1)
plot_forecast_table(forecast_ens1$calibration)| Accuracy Table | ||||||||
|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
| 1 | ENSEMBLE (MEAN): 2 MODELS | Test | 1266.32 | 3.18 | 0.96 | 3.17 | 1480.73 | 0.77 |
However, the ensemble does not improve NNETAR performance.
plot_forecast(forecast_ens1$fc)Nevertheless, the fit seems to be appropriate for the data and could be used if we want to use a faster and lighter model.
Now the ensemble is the combination of the elastic net model and the neural network.
set.seed(123)
idx2 = c(1, 4) # selection of models
mod_ensemble2 = mlForecast$calibration[idx2,] %>%
ensemble_average(type = "mean")
mod_ensemble2## ── Modeltime Ensemble ───────────────────────────────────────────
## Ensemble of 2 Models (MEAN)
##
## # Modeltime Table
## # A tibble: 2 × 5
## .model_id .model .model_desc .type .calibration_data
## <int> <list> <chr> <chr> <list>
## 1 1 <workflow> GLMNET Test <tibble [31 × 4]>
## 2 4 <workflow> NNAR(1,10) Test <tibble [31 × 4]>
The accuracy results.
forecast_ens2 = forecast_results(mod_ensemble2)
plot_forecast_table(forecast_ens2$calibration)| Accuracy Table | ||||||||
|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
| 1 | ENSEMBLE (MEAN): 2 MODELS | Test | 1040.91 | 2.6 | 0.79 | 2.59 | 1264.53 | 0.77 |
In this case, the ensemble got almost the same accuracy than the single neural network.
plot_forecast(forecast_ens2$fc)Finally, the last ensemble is composed by the XGBoost and the neural network.
set.seed(123)
idx3 = c(3, 4) # selection of models
mod_ensemble3 = mlForecast$calibration[idx3,] %>%
ensemble_average(type = "mean")
mod_ensemble3## ── Modeltime Ensemble ───────────────────────────────────────────
## Ensemble of 2 Models (MEAN)
##
## # Modeltime Table
## # A tibble: 2 × 5
## .model_id .model .model_desc .type .calibration_data
## <int> <list> <chr> <chr> <list>
## 1 3 <workflow> XGBOOST Test <tibble [31 × 4]>
## 2 4 <workflow> NNAR(1,10) Test <tibble [31 × 4]>
forecast_ens3 = forecast_results(mod_ensemble3)
plot_forecast_table(forecast_ens3$calibration)| Accuracy Table | ||||||||
|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
| 1 | ENSEMBLE (MEAN): 2 MODELS | Test | 937.74 | 2.35 | 0.71 | 2.32 | 1158.56 | 0.83 |
Surprisingly, this ensemble overcomes the previous one composed by the two best models and gets a better result than the single neural network.
plot_forecast(forecast_ens3$fc)As it can be seen, the model fits very well the data and the prediction intervals are very narrow.
After the whole analysis, several conclusions can be taken out respect the data set and the models employed.
At first point, how the evaluation of the models influences a lot the decision. As we have seen, training and testing in one split provides us a performance to clearly determine which is the optimal model. However, in the moment when the cross validation is applied and the models are evaluated in the whole set of splits the decision is more difficult.
This behavior is the expected and that is why we apply cross validation in time series. As we are providing more data to the model, one could behave better or worse than previously. In this point, the decision should be carried out according to previous experiences or in what we are interested, thinking in how much data should be used to train.
The neural network NNETAR as well as the ensemble composed by this one and the XGBoost model, obtain the best results in the test set of the first split and the forecasting is very accurate. However, if we re-train the models with all splits, we have seen how the simpler elastic net model obtains a better score. In this point is when we should consider which model we want to use and it it is worthy to use more data or only the recent one.
Consulting to an expert in the topic would be the optimal approach, trying to know if old fluctuations or any past seasonality may affect the future forecast and choose a model which performs better as average, rather than in the most recent data. In the case of bitcoin’s price, where market is being updated constantly, the daily frequency may be even too slow and data collected months ago very old. For this reason, the best approach may be using the models which behaves better in the most recent data, in our case, the ensemble composed by the NNETAR and the XGBoost.
final_forecast = forecast_results(mod_ensemble3, back=150)
plot_forecast(final_forecast$fc)At second point, how the variables are related with the response. If we look again into the time series plots.
bitcoin_by_var %>%
ggplot(aes(x = date, y = value)) +
geom_line() +
labs(title = "Daily Values",
x = "", y="variables") +
facet_grid(vars(var), scales = "free_y") +
theme_bw() At first instance the variables trans and trans_cost may have a high correlation with the response price, since the trend is similar. In the case of volume the similarity is not very clear. This could be related with the result obtained in the VAR model, in which only these two variables have lagged effects into the target.
Mention also the weekly seasonality, since the VAR model returns a 7 day lagged effect from price, and we have seen in the correlograms and the STL decomposition that the variables have a weekly seasonal pattern.
In addition to that, the inclusion of variables have improved the performance of the simple arima model and the dynamic regression with more predictors provides the best performance. For this reason, these features are being util for the prediction.